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Abstract —Designing efficient channel access schemes for wire¬ 
less communications without any prior knowledge about the 
nature of environments has heen a very challenging issue, in 
which the channel state distribution of all spectrum resources 
could be entirely or partially stochastic or adversarial at different 
time and locations. In this paper, we propose an online learning 
algorithm for adaptive channel access of wireless communications 
in unknown environments based on the theory of multi-armed 
bandits (MAB) problems. By automatically tuning two control 
parameters, i.e., learning rate and exploration probability, our 
algorithms could find the optimal channel access strategies and 
achieve the almost optimal learning performance over time in 
different scenarios. The quantitative performance studies indicate 
the superior throughput gain when compared with previous 
solutions and the flexibility of our algorithm in practice, which 
is resilient to both oblivious and adaptive jamming attacks with 
different intelligence and attacking strength that ranges from no¬ 
attack to the full-attack of all spectrum resources. We conduct 
extensive simulations to validate our theoretical analysis. 

Index Terms —Online learning, jamming attack, stochastic and 
adversarial bandits, wireless communications, security. 

I. Introduction 

The design of channel access schemes is a pivotal problem 
in wireless communications. Stimulated by the recent appear¬ 
ance of smart wireless devices with adaptive and learning 
abilities, modern wireless communications have raised very 
high requirements to its solutions, especially in complex envi¬ 
ronments, where accurate instant channel states can barely be 
acquired before transmission and long term channel evolution 
process are unknown (e.g., cognitive radio, smart vehicular 
and military communications). Thus, it is critical for wireless 
devices to learn and select the best channels to access in 
general unknown wireless environments. 

Many recent works have tackled the channel access problem 
in unknown environments by online learning approaches, al¬ 
most all of which are well formulated as the Multi-armed ban¬ 
dit (MAB) problem HI due to its inherent capability in keeping 
a good balance between “exploitation” and “exploration” for 
the selection of channels and the superior throughput gain 
with the finite-time optimality guarantee, e.g., El-m. The 
main goal is to find a channel access strategy that achieves the 
optimal expected throughput by minimizing the term “regret” 
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as learning performance metric, i.e., the performance gap be¬ 
tween the proposed strategy and the optimal fixed one known 
in hindsight, accumulated over time. Briefly speaking, these 
works can be categorized into two different types of MAB 
models, namely, adversarial (non-stochastic) MAB Gl-il 
m, O] and stochastic MAB US, US- Stochastic 

MAB assumes that the channel state follows some unknown 
i.i.d. process, while adversarial MAB assumes that the channel 
state can be controlled arbitrarily by adversaries (e.g., jamming 
attackers) where its distribution is not i.i.d (i.e., non-i.i.d.) any 
more. Accordingly, the analytic approaches and performance 
results of the two models are distinctively different. A well- 
known truth is that stochastic MAB and adversarial MAB have 
regrets of order “logarithmic-t” m and “root-t’ ’ m over 
time t, respectively. Obviously, the learning performance of 
the stochastic MAB highly outperforms that of the adversarial 
MAB. 

As we know, one key assumption of almost all existing 
works is the nature of environments, as a known prior, is 
either stochastic or adversarial. This is limited in describing 
general wireless environments in practice, although it largely 
captures the main characteristic of them. Because, in many 
practical wireless applications, the nature of the environment 
is not restricted to either the stochastic or the adversarial type, 
and it usually can not be known in advance. 

On the one hand, the application of existing models may 
lead to bad learning performance when no prior about the 
environment is available. Consider a wireless network de¬ 
ployed in a potentially hostile environment. The number and 
locations of attackers are often unrevealed to the wireless 
networks. In this scenario, most likely, certain portions of 
spatially-dispersed channels may (or may not) suffer from 
denial of service attackers that are adversarial, while others 
are stochastic distributed. Compared with the classic mind 
that uses the adversarial MAB model 0-11 cni, nil, here 
to design optimal channel access strategies, the use of the 
stochastic MAB may not be feasible due to the existence of 
adversaries. Meanwhile, the use of the adversarial MAB model 
on all channels will lead to large values of regret, since a great 
portion of channels can still be stochastically distributed. Thus, 
it is hard to decide the type of MAB models to be used in the 
first place. 

On the other hand, the channel access strategy based on 
stochastic MAB model 0-0 na, ca will face practical 
implementation issues, even though it is certain that there 
is no long term adversarial behavior. In almost all wireless 
communication systems, the commonly seen occasionally dis¬ 
turbing events would make the stochastic channel distribu¬ 
tions contaminated. These include the burst movements of 


individuals, the jitter effects of electronmagnetic waves, and 
the seldom but irregular replacement of obstacles, etc. In this 
case, the channel distribution will not follow an i.i.d. process 
for a small portion of time during the whole learning period. 
Thus, whether the stochastic MAB theory is still applicable, 
how the contamination affects the learning performance and 
to what extend the contamination is negligible are not clear to 
us. Therefore, the design of a unified channel access scheme 
without any prior knowledge of the operating environment 
is very challenging. It is highly desirable and bears great 
theoretical value. 

In this paper, we propose a novel adaptive multi-channel ac¬ 
cess algorithm for wireless communications that achieves near- 
optimal learning performance without any prior knowledge 
about the nature of the environment, which provides the first 
theoretical foundation of scheme design and performance char¬ 
acterization for this challenging issue. The proposed algorithm 
neither needs to distinguish the stochastic and adversarial 
MAB problems nor requires the time horizon for run. To the 
best of our knowledge, ours is the first work that bridges the 
stochastic and adversarial MABs into a unified framework 
with promising applications in practical wireless systems. 

The idea is based on the famous EXP3 algorithm 
in the non-stochastic MAB by introducing a new control 
parameter into the exploration probability for each channel. 
By joint control of learning rate and exploration probability, 
the proposed algorithm achieves almost optimal learning per¬ 
formance in different regimes. When the environment happens 
to be adversarial, our proposed algorithm enjoys the same 
behavior as classic adversarial MABs-based algorithms and 
has the optimal regret“root-t” bound in the adversarial regime. 
When the environment happens to be stochastic, we indicate 
a problem-dependent “polylogarithmic-t” regret bound, which 
is slightly worse than the optimal “logarithmic” bound in ifTSl . 
Furthermore, we prove that the proposed algorithm retains the 
“polylogarithmic-t” regret bound in the stochastic regime as 
long as on average the contamination over all channels does 
not reduce the gap A between the optimal and suboptimal 
channels by more than a half. Note that all regret bounds 
are sublinear to time horizon, which indicates the optimal 
channel access strategy is achievable. Our main contributions 
are summarized as follows. 

1) We categorize the features of the general wireless com¬ 
munication environments mainly into four typical regimes: the 
adversarial regime, the stochastic regime, the mixed adver¬ 
sarial and stochastic regime, and the contaminated stochastic 
regime. We provide solid theoretical results for them, each of 
which achieves the almost optimal regret bounds. 

2) Our proposed AUFH-EXP3-H- algorithm considers the 
statistical information sharing of a channel that belongs to 
different transmission strategies, which can be regarded as a 
special type of combinatorial semi- bandiQ problem. In this 
scenario, given the size of all channels n and the number 
of receiving channels kr, AUFH-EXP3-H- achieves the regret 
of order 0{kr v/tnInn) in the adversarial regime (for usu- 

^The term first appears in ED, which means the reward of each item 
within the combinatorial MAB strategy as a played arm will be revealed to 
the decision maker. 


ally considered oblivious adversary) and the regret of order 
in other stochastic regime up to time t. From the 
perspective of parameters n and kr for different configurations 
of wireless communications, AUFH-EXP3-H- achieves tight 
regret bound in both the adversarial setting ll^ and the 
stochastic setting 1291 . We also study the performance of our 
algorithm under adaptive adversary for the first time. 

3} We provide a computational efficient enhanced version 
of the AUFH-EXP3-H- algorithm. Our algorithm enjoys linear 
time and space complexity in terms of n and kr that indicates 
very good scalability, which can be implemented in large scale 
wireless communication networks. 

4) We conduct plenty of diversified numerical experiments, 
and simulation results demonstrate that all advantages of the 
AUFH-EXP3-H- algorithm in our theoretical analysis is real 
and can be implemented easily in practice. 

The rest of this paper is organized as follows: Related works 
are discussed in Section II. Section III describes the commu¬ 
nication model, problem formulation, and the four regimes. 
Section IV introduces the optimal adaptive uncoordinated fre¬ 
quency hopping algorithm, AUFH-EXP3-H-. The performance 
results for different regimes are presented in Section V, while 
their theoretical proofs are shown in Section VI. Section 
VII presents a computational efficient implementation of the 
AUFH-EXP3-H- algorithm. Numerical and simulation results 
are available in Section VIII. Finally, we conclude the paper 
in Section IX. 

II. Related Works 

Recently, online learning-based approaches to address wire¬ 
less communications and networking problems in unknown 
environments have gained growing attention. The characteris¬ 
tics of learning by repeated interactions with environments are 
usually categorized into the domain of reinforcement learning 
(RL). It is worth pointing out that there exists extensive 
literature in RL, which generally target at a broader set of 
learning problems in Markov Decision Processes (MDPs) ll^ . 
As we know, such learning algorithms can guarantee optimally 
only asymptotically to infinity, which cannot be relied upon 
in mission-critical applications. However, MAB problems con¬ 
stitute a special class of MDPs, for which the regret learning 
framework is generally viewed as more effective both in terms 
of convergence and computational complexity. Thus, the use 
of MAB models is highly identified. The works based on the 
stochastic MAB model often consider about the stochastically 
distributed channels in benign environments, such as dynamic 
spectrum access isi ED m, cognitive radio networks 0, 
channel monitoring in infrastructure wireless networks 11 
na, wireless scheduling 0, and channel access scheduling 
in multi-hop wireless networks Q, etc. The adversarial MAB 
model is applied to adversarial channel conditions, such as 
the anti-jamming wireless communications El 0 d, short- 
path routing 12111 l3T1l . non-stochastic channel access affected 
by primary user activity in cognitive radio networks 0 and 
power control and channel selection problems ifTOll . 

The stochastic and adversarial MABs have co-existed in 
parallel for almost two decades. Only until recently, the 
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attempt of ||25]| to bring them together did not make it in 
a full sense of unification, since the algorithm relies on the 
knowledge of time horizon and makes an one-time irreversible 
switch between stochastic and adversarial operation modes if 
the beginning of the play exhibits adversarial behavior. The 
first practical algorithm for both stochastic and adversarial 
bandits is proposed in ll24l . Our current work uses the idea of 
introducing the novel exploration parameter (/) ll24l into our 
own special combinatorial exponentially weight algorithm by 
exploiting the channel dependency among different strategies. 

This new framework avoids the computational inefficiency 
issue for general combinatorial adversary bandit problems as 
indicated in Q m- It achieves a regret bound of order 
0{krVtn Inn), which only has a factor of 0{-\/h^) factor 
off when compared to the optimal 0{-s/krtn\ivn) bound in 
the combinatorial adversary bandit setting ll^ . However, we 
do believe that the regret bound in our framework is the 
optimal one for the exponential weight (e.g. EXP3 QSl) 
type of algorithm settings in the sense that the algorithm 
is computationally efficient. Thus, our work is also a first 
computationally efficient combinatorial MAB algorithm for 
general unknown environment. What is more surprising and 
encouraging, in the stochastic regimes (including the contami¬ 
nated stochastic regime), all of our algorithms achieve a regret 
bound of order in the sense of channel numbers 

n and size of channels within each strategy k^, this is the 
best result for combinatorial stochastic bandit problems 1^ . 
Please note that in 0, they have a regret bound of order 
in ITt), the regret bound is 0(2—1^-^); and in 
El, the regret bound is 0(2^-^^^—^). Thus, our proposed 
algorithms are order optimal with respect to n and kr for 
all different regimes, which indicate the good scalability for 
general wireless communication systems. 

III. Problem Formulation 
A. System Model 

We consider two wireless devices communicating in an 
unknown environment, each is within the other device’s 
transmission range. The sender transmits data packets to the 
receiver synchronically over time. The wireless environment 
is highly flexible in dynamics, where states of the channels 
could follow different stochastic distributions and could also 
suffer from different kinds of potentially adversarial attacks. 
They can also vary over different time slots and channel sets. 
Without loss of generality (w.l.o.g.), we consider the jamming 
attack as the representative adversarial model. Specifically, 
we will categorize the feature of wireless communication 
environments into four typical regimes in our next discussions. 
The transceiver pair selects multiple channels to send and 
receive signals over a set of n available orthogonal chan¬ 
nels with possibly different data rates across them. We do 
not differentiate channels and frequencies in our discussion. 
During each time slot, the transmitter chooses kt out of n 
channels to transmit data and the receiver chooses kr out 

^ As noticed, the stochastic combinatorial bandit problem does not have 
this issue as indicated in □ (m 


of n channels to receive data. We assume the transmitter 
and receiver do not pre-share any secrets with each other 
before data communication, and there is no feedback channel 
from the receiver to the transmitter. We assume one jammer 
launches attack to the transceiver pair over n channels, and the 
jammer does not have the knowledge about the transceiver’s 
strategies before data communication. The data packets rate 
at time slot t from the transmitter on channel / is denoted by 
9t{f), 9t{f ) S [0,M]. Here constant M is the maximum data 
rate for all channels. 

B. The Adaptive Uncoordinated Frequency Hopping Problem 

Since no secret is shared and no adversarial event is 
informed to the transceiver pair, the multi-channel wireless 
communications in unknown environments are necessary to 
use frequency hopping strategies to dynamically select a subset 
of channels to maximize its accumulated data rates over time. 
We name ours as the Adaptive Uncoordinated Frequency 
Hopping (AUFH) protocol due to its flexibility to achieve 
optimal performance in various scenarios, when compared to 
the recent and sophisticated developed UFH protocol in ||2| 
ca. Here the receiver’s selection of the frequency hopping 
strategy to maximize the cumulated data packets reception has 
the following challenge: 1) it does not know the transmitter 
and adversarial events in the environment, thus it has no 
good channel access strategy to begin with; 2) the receiver is 
desirable to have an adaptively optimal channel access strategy 
in all different situations. 

We consider the AUFH problem as a sequential decision 
problem, where the choice of receiving channels at each time 
slot is a decision. Denote {0,1}" as the vector space of all 
n channels. The strategy space for the transmitter is denoted 
as St C {0,1}” of size (^), and the receiver’s is denoted as 
Sr C {0,1}" of size (^). If the /‘^-channel is selected for 
transmitting and receiving data, the value of the /-th entry of 
a vector (channel access strategy) is 1, and 0 otherwise. In 
the case of the existence of jamming attack on a subset of 
kj channels, the strategy space for the jammer is denoted as 
Sj C {0,1}" of size (^ ). For convenience, we say that the 
/-th channel is jammed if the value of /-th entry is 0 and 
otherwise is 1. At each time slot, after choosing a strategy Sr, 
the value of the data rate (or called “reward”) gt{f) is revealed 
to the receiver if and only if / is chosen as a receiving channel. 

Formally, the frequency hopping multi-channel access game 
can be formulated as a MAB problem that is described as 
follows: at each time slot t = 1,2,3,..., the receiver as a 
decision maker select a strategy It from Sr- The cardinality 
of Sr is I Sr I = N. The reward gt{f) is assigned to each 
channel / G {1, ...,n} and the receiver only get rewards in 
strategy i G Sr- Note that It denotes a particular strategy 
chosen at time slot t from the receiver’s strategy set Sr, and 
i denotes a strategy in Sr- The total reward of a strategy i 
in time slot t is gt{i) = Then, on the one hand, 

the cumulative reward up to time slot t of the strategy i is 
Gtii) = ELi 9sii) = ELi 9sif)- On the other hand, 

the total reward over all the chosen strategies by the receiver 
up to time slot i is G* = E*=i 9s{Is) = ELi E/g/. 9s{f), 
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(a) Adversarial Regime 


(b) Stochastic Regime 




(c) Mixed Adversarial and Stochastic Regime 


(d) Contaminated Stochastic Regime 


Fig. 1; Multi-channel Wireless Communications in Different Regimes 


where the strategy Is is chosen randomly according to some 
distribution over Sr- The performance of this algorithm is 
qualified by regret Rit), defined as the difference between the 
expected number of successfully received data packets using 
our proposed algorithm and the expected rewards that use the 
best fixed solution up to t time slots for the game, i.e., 


R{t) = max E {G't(i)} — E 


Gt 


( 1 ) 


where the maximum is taken over all available strategies 
to the receiver. However, during the theoretical analysis of 
the AUFH-based algorithm in our next discussion, if we 
use the gain (reward) model, we have to apply additional 
smoothing of the playing distribution qt{f) regarding gt{f)- 
Thus, we can introduce the loss model by the simple trick of 
itif) = l~<7t(/) for each channel / andft(*) = kr—gt{i) for 
each strategy, respectively. Then, we have Lt{i) = tkr — Gt{i) 
where Ltii) = = J2l=iT, and sim¬ 

ilarly, we have Lt = tkr — Gt- We use £([•] to denote 
expectations on realization of all strategies as random variables 
up to round t- Therefore, the expected regret R{t) can be 
rewritten as 


E 


Lt 


- minE{Lt(i)} = EJ2 ^t{It) - minE Y. ^t{i) 

iGSr s=l i^Sr 


= E[SE« 

S = 1 


E 4(/)]] - min(E[X; Es[Y ^sif)])- 


( 2 ) 




ieS- 


fei 


The expectation is taken over the possible randomness of the 
proposed algorithm and loss generation model. The goal of the 
algorithm is to minimize the regret. The above definition of 
regret is usually named as the pseudo regret HI, which is upper 
bounded by the expected regret E{.R(t)} = (-tilt) — 

minigs^ £((i)}. Only when the adversary is oblivious, 

who prepares the entire sequence of loss functions G{It) (t = 
1,2,3,...) in advance, pseudo regret (|2|l coincides with the 
standard expected regret E{.R(f)} fl]. 


Note that the choice of the loss function at time slot t of the 
oblivious adversary is independent to the first t—1 time slots. 
Otherwise, the adversary can be called an adaptive adversary- 
In this case, let us denote the decision maker’s entire sequence 
of strategies up to current timslot t as (Ii,...,/(), which we 
abbreviate by The expected cumulative loss suffered 

by the player after t rounds is ('s(/i ... g)]. We need 

to compare it with a competitor class Ct, which is simply a 
set of deterministic strategy sequences of length t- Intuitively, 
we would like to compare the decision maker’s loss with 
the cumulative loss of the best action sequence in Ct- In 
practice, the most common way to evaluate the decision 
maker’s performance is to measure its external pseudo-regret 
compared to Ct M- Thus, the regret for adaptive adversary 
is defined as, 

t 

R{t)= max E[^ (fs(/i,...,s) - fs(/i,...,s_i,ys))]. (3) 

This regret definition is suitable for most of the theoretical 
works of the online learning and bandit setting. If the adversary 
is oblivious, we have £t{Ii,...,t) equals to £t{It)- With this 
simplified notation, the regret in (O becomes 

t t 


E[^4(/.)]- 

S=1 




mm 


S = 1 


£s{ys), 


which is exactly the same as (|2|l with Ct = Sr, if we take an 
expectation over all the strategy sequence (j/i,..., yj). 


C. The Four Regimes of Wireless Environments 

Since our algorithm does not need to know the nature of the 
environments, there exist different features of the environments 
that will affect its performance. We categorize them into the 
four typical regimes as shown in Fig. 1. 

1) Adversarial Regime: In this regime, there is a jammer 
sending interfering power or injecting garbage data packets 
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over all n channels such that the transceiver’s channel rewards 
are completely suffered by an unrestricted jammer (See Fig.l 
(a)). When we assume the use of the same level of transmission 
power as in the stochastic regime, the data rate will be 
signihcantly reduced in the adversarial regime. Note that, as a 
classic model of the well known non-stochastic MAB problem 
II3, the adversarial regime implies that the jammer often 
launches attack in almosj^ every time slot. It is the most 
general setting and the other three regimes can be regarded as 
special cases of the adversarial regime. Obviously, a strategy 
i S argminj,gg Lt(i')]} is known as a best strategy 

in hindsight for the hrst t round. 

Attack Model: Different attack philosophies will lead to 
different level of effectiveness. We classify various types of 
jammers into the following two categories in the adversarial 
regime: 

a) Oblivious jammer: an oblivious jammer could attack 
different channels with different jamming strength as a result 
of different data rate reductions. Its current attacking strategy 
is not based the observed past communication records. As 
described in im, oblivious jammer can use static and random 
strategies to attack wireless channels. If its attacking strategy 
is time-independent (e.g. static jammer), we can simply regard 
it as a stochastic channel with bad channel quality. Usually, the 
attacking strategy for oblivious jammer can change with time. 
As noticed, many other kinds of jammers, such as partial band 
jamming, sweep jamming etc. ESh all belong to the oblivious 
attack model. Briefly, it is a simple attack model that does 
not react to the defending algorithm, although the attackers’ 
attacking strategies could be largely different. 

b) Adaptive jammer: an adaptive jammer, also named as 
non-oblivious jammer, adaptively selects its jamming strength 
on the targeted (sub)set of jamming channels by utilizing its 
past experience and observation of the previous communica¬ 
tion records. In the adversarial regime, we consider that the 
adaptive jammer is very powerful in the sense that it does 
not only know the communication protocol and able to attack 
with different level of strength over a subset of channels for 
data communications during a single time slot, but it also can 
monitor all the n available channels during the same time slot. 
For example, the reactive jammer with the behavior described 
in EM belongs to this type. As shown in a recent work 
ca, no bandit algorithm can guarantee a sublinear regret o{t) 
against an adaptive adversary with unbounded memory. The 
adaptive adversary can mimic and perform the same learning 
algorithm as the decision maker, i.e., the receiver in our work. 
It can set the same channel access probabilities as the channel 
access algorithm, which will lead to a linear regret. Therefore, 
we consider a more practical m-memory-bounded adaptive 
adversary M model, which is constrained to choose loss 
functions that depend only on the to- 1-1 most recent strategies. 

2 ) Stochastic Regime: In this regime, the transceiver com¬ 
municates over n stochastic channels as shown in Fig.l (b). 

^ Strictly speaking, according to the definition and analysis of the con¬ 
taminated stochastic regime in the next discussion, when the total number of 
contaminated locations of round-channel pairs (f, /) by the jammer on each 
channel up to time t is largely great than tA(/)/4 on average, then we can 
regard it belongs to the adversarial regime. 


The channel losses ^t(/),V/ G 1, ...,n (Obtained by transfer¬ 
ring the reward to loss it{f) = 1 — 9t{f)) of each channel 
/ are sampled independently from an unknown distribution 
that depends on /, but not on t. We use pf = E[ft(/)] to 
denote the expected loss of channel /. We define channel / 
as the best channel if p{f) = miny/{/i(/')} and suboptimal 
channel otherwise; let /* denote some best channel. Similarly, 
for each strategy i G Sr, we have the best strategy p{i) = 
mini/{^jg^, p{f)} and suboptimal strategy otherwise; let i* 
denote some best strategy. For each channel /, we define the 
gap A(/) = p{f) - p(f*); let A/ = min/,A(/)>o {A(/)} 
denote the minimal gap of channels, or the gap from the 
second best channel(s). Similarly, for each strategy i, we have 
A(i) = p{i) - p{i*y, let A* = minj.A(i)>o {A(i)} denote 
the minimal gap of strategies. Let Nt{f) and Nt{i) be the 
respective number of times channel / and strategy i was 
played up to time t, the regret can be rewritten as R{t) based 
on Nt{f), and we have 

R{t) = Y,E[NS)]A{t) < R{t) = ^E[A*(/)]A(/).(4) 
* / 

Note that we calculate the upper bound regret R{n) from 
the perspective of channel set /C, where the regret is upper 
bounded by the regret from the perspective of strategies set 
Af. This is because the set of strategies is of the size ) that 
grows exponentially with respect to n and it does not exploit 
the channel dependency among different strategies. We thus 
calculate the upper regret from the perspective of channels, 
where tight regret bounds are achievable. 

3) Mixed Adversarial and Stochastic Regime: This regime 
assumes that the jammer only attack kj out n channels at each 
time slot. As shown in Fig. 1 (c), there is always a /cj /n portion 
of channels that suffers from jamming attack while the other 
(n — kj)/n portion is stochastically distributed. We call this 
regime the mixed adversarial and stochastic regime. 

Attack Model: We consider the same type of jammer as 
described in the adversarial regime for the mixed adversarial 
and stochastic regime, which includes: static jamming and 
random jamming of the oblivious jammer and the adaptive 
jammer. The difference here is that the jammer only attacks a 
subset of channels of size kj over the total n channels not all 
channels. 

4) Contaminated Stochastic Regime: The definition of the 
contaminated stochastic regime comes from many practical 
observations that only a few channels and time slots are 
exposed to adversary. Here comes the question: is this en¬ 
vironment still stochastic or adversarial? We are fortunate to 
answer this question. In this regime, for oblivious jammer, it 
selects some slot-channel pairs (f, /) as “locations” to attack 
before the multi-channel wireless communications start, while 
the remaining channel rewards are generated the same as the 
stochastic regime. We can introduce and define the attacking 
strength parameter ( G [0,1/4). After certain r timslots, for 
all t > T the total number of contaminated locations of each 
suboptimal channel up to time t is tA{f)C and the number of 
contaminated locations of each best channel is tAf(j. 

We call a contaminated stochastic regime moderately con¬ 
taminated, if by the definition (j is at most 1/4, we can prove 
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that for all f > r on the average over the stochasticity of 
the loss sequence the adversary can reduce the gap of every 
channel by at most one half. Thus, if the attacking strength 
C G [Oj 1/4], the environment can still be regarded as benign 
that behaves stochastically (though it is contaminated). 


Algorithm 1 AUFH-EXP3++: An MAB-based Algorithm for 
AUFH_ 

Input: n, kr, t, and See text for definition of % and 
Initialization: Set initial channel and strategy losses Vi G 
[1,At], Lo(i) = 0 and V/ G [l,n],£o{f) = 0, respec¬ 
tively; Then the initial channel and strategy weights Vi G 
[1, At], lTo(*) = kr and V/ G [l,n],woif) = 1> respec¬ 
tively. The initial total strategy weight Wq = N = (J^)- 

(/) = min{^,/3t,6 (/)} ,V/ G [l,n]. 

for time slot f = 1, 2, ... do 

1: The receiver selects a channel hopping strategy It at 
random according to the strategy’s probability G 

[l,n], with pt{i) computed as follows: 

(i - EU "‘(/)) ^ ^ ^ 

(i-E:?=i £*(/)) ^ if^iC 

The computation is taken for the probability distributions 
over all strategies pt = (pt(l),pt(2), ...,pt{N)). 

2: The receiver computes the probability qt{f),'^f G 
[l,n], as 

Qtif) = = (i - etif)) 

+ E ^t{f) |{i G C : / G i}| . 
fei 

Then, the probability distributions over all channels are 

It = (%(!), <7t(2),...,gt(n)). 

3: The receiver calculates the loss for channel /, 
ft_i(/),V/ G It based on the received channel gain 
by using £t-i{f) = 1 - 9t-i{f)- Compute the 
estimated loss ft(/),V/ G [l,n] as follows: 

if channel f G /* 

0 otherwise. 

4: The receiver updates all the weights as 
wt if) = wt-i (/) 

Wt (i) = n Wt{f ) = Wt -1 (*) 

/Gi 

where Lt(/) = ^ Lt-i{f) + ft-i//),^t-i(e) = 

Efei^t-i{f) and Lt{i) = Tt-i(*) + The sum 

of the total weights of the strategies is 

Wt=^wt (i) 
ieSr 

end for 


IV. The Optimal Adaptive Uncoordinated 
Frequency Hopping Algorithm 
In this section, we develop an AUFH algorithm in the re¬ 
ceiver side. The design philosophy is that the receiver collects 
and learns the rewards of the previously chosen channels. 


based on which it can decide the next time slot channel access 
strategy. The main difficulty is that the algorithm is required 
to appropriately balance between exploitation and exploration. 
On the one hand, the algorithm needs to keep exploring the 
best set of channels to receive the data packets due to the 
dynamic changing of the environments; on the other hand, 
the algorithm needs to exploit the already selected best set of 
channels so that they will not be under-utilized. 

We describe the Algorithm 1 named as AUFH-FXP3-H-. It 
is a variant based on FXP3 algorithm and im, whose perfor¬ 
mance in the four regimes will be asymptotically optimal. Our 
new algorithm uses the fact that when rewards of channels of 
the chosen strategy are revealed as in step 1 of the Algorithm 1, 
this also provides some information about the rewards of each 
strategy sharing common channels with the chosen strategy, 
i.e., the probability that all the strategies that share the same 
channel would be projected to it in step 2. As noticed, the 
conversion from rewards (gains) to losses is done to facilitate 
subsequent performance analysis. During each time slot, we 
assign a channel weight that is dynamically adjusted based on 
the channel losses revealed to the receiver as shown in step 3. 
Then, in the step 4, the weight of a strategy is determined by 
the product of weights of all channels. 

Compared to d that targets only for secure wireless 
communications, our algorithm has two control parameters: 
the learning rate rjt and the exploration parameters ^t{f ) for 
each channel /, whereas the algorithm in lITTIl does not explore 
the using of the parameters ^t{f) to detect the other regimes 
of the environment. The key innovation here is that we have 
used the advanced martingale concentration inequalities (i.e., 
Femma 8) to detect i.i.d, contaminated and non-i.i.d. behaviors 
without the knowledge about the nature of the environments, 
and the exploration parameter ^t(/) is tuned individually for 
each channel depending on the past observations. 

Fet N denote the total number of strategies at the receiver 
side. A set of covering strategy is defined to ensure that each 
channel is sampled sufficiently often. It has the property, for 
each channel /, there is a strategy i G C such that f G i. 
Since there are only n channels and each strategy includes 
channels, we have \C\ = The value Ef^i^tif) means 

the randomized exploration probability for each strategy i G C, 
which is the summation of each channel /’s exploration prob¬ 
ability Et (/) that belongs to the strategy i. The introduction 
of EfEi^t (/) ensures that pt{i) > Ef^i^tU) so that it is 
a mixture of exponentially weighted average distribution and 
uniform distribution 12^ over each strategy. 

In the following discussion, the learning rate rjt is sufficient 
to control and obtain the regret of the AUFH-FXP3-H- in 
the adversarial regime, regardless of the choice of exploration 
parameter The exploration parameter ^t(/) is sufficient 

to control the regret of AUFH-EXP3-H- in the stochastic 
regimes regardless of the choice of rjt, as long as rjt > /3t. 
To facilitate the AUFH-EXP3-H- algorithm without knowing 
about the nature of environments, we can apply the two control 
levers simultaneously by setting rjt = pt and use the control 
parameter £,t{f) in the stochastic regimes such that it can 
achieve the optimal “root-t” regret in the adversarial regime 
and almost optimal “logarithmic-t” regret in the stochastic 
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regime (though with a suboptimal power in the logarithm). 

V. Performance Results in Different Regimes 

We analyze the regret performance of our proposed AUFH- 
EXP3++ algorithm in different regimes in the following 
section. W.l.o.g., we normalize M = 1 in all our results 
to facilitate clear comparisons with regret bounds of others’ 
works. 

A. Adversarial Regime 

We first show that tuning rjt is sufficient to control the 
regret of AUFH-EXP3++ in the adversarial regime, which is 
a general result that holds for all other regimes. 

Theorem 1. Under the oblivious jamming attack, no matter 
how the status of the channels change (potentially in an 
adversarial manner), for rjt = jit and any ^t(/) > 0, the regret 
of the AUEH-EXP3++ algorithm for any t satisfies: 

R{t) < AkrVtn Inn. 

Theorem 2. Under the m-memory-bounded adaptive jam¬ 
ming attack, no matter how the status of the channels change 
(potentially in an adversarial manner), for rjt = jit and any 
6(/) ^ the regret of the AUEH-EXP3-H- algorithm for any 
t is upper bounded by: 

Rjt) < (to + l){Akr'yn lnn)3f3 -p ojt^). 


performed in background for each channel / that starts from 
the running of the algorithm, i.e., 

At(/) = minjl, j - iriin(Zt(/'))^ | . (4) 

This is a first algorithm that can be used in many real-world 
applications. 


Theorem 4. Let c > 18 and > jit- Let t* be the 
minimal integer that satisfies t* > 


and let U(/) = 


max < t* 


ol/Aiff 


ln(n) 


I and t* = max[f^ri}t*{f). The regret 

of the AUFH-EXP3-H- algorithm with ^t(/) = 

termed as AUFH-EXP3 -h-'^'"'^, 
isfies: 


in the stochastic regime sat- 


Tt / 1 1 / \3 \ ^ 

m< E E A(/)U(/) 

/=l,A(/)>0 ^ ^ /=l,A(/)>0 

= o 

From the theorem, we see in this more practical case, another 
factor of ln{t) worse of the regret performance when compared 
to the idealistic case. Also, the additive constants t* in this 
theorem can be very large. However, our experimental results 
show that a minor modification of this algorithm performs 
comparably to ComUCBl ||2^ in the stochastic regime. 


C. Mixed Adversarial and Stochastic Regime 


B. Stochastic Regime 

Now we show that for any rjt > jit, tuning the exploration 
parameters ^t(/) is sufficient to control the regret of the 
algorithm in the stochastic regime. We consider a different 
number of ways of tuning the exploration parameters ^t{f) for 
different practical implementation considerations, which will 
lead to different regret performance of AUFH-EXP3-H-. We 
begin with an idealistic assumption that the gaps A(/), V/ G n 
is known, just to give an idea of what is the best result we 
can have and our general idea for all our proofs. 

Theorem 3. Assume that the gaps A(/),V/ S n, are 
known. Let t* be the minimal integer that satisfy t*{f) > 
ic ^ choice of pt > (3t and any 

c > 18, the regret of the AUFH-EXP3-H- algorithm with 
^t{a) = ^ in the stochastic regime satisfies: 


R{t) < E o 

/=l,A(/)>0 

^ krTl In 

. J 


( kr In 

\-Aur) 


E A(/)U(/) 

/=l.A(/)>0 


= o 


+ E o 

/=l,A(/)>0 


(a^) ■ 


From the upper bound results, we note that the leading 
constants kr and n are optimal and tight as indicated in Com- 
bUCBl ll29l algorithm. However, we have a factor of ln(f) 
worse of the regret performance than the optimal “logarithmic” 
regret as in m M- 

1) A Practical Implementation by estimating the gap: 
Because of the gaps A(/),V/ G n can not be known in 
advance before running the algorithm. In the next, we show 
a more practical result that using the empirical gap as an 
estimate of the true gap. The estimation process can be 


The mixed adversarial and stochastic regime can be re¬ 
garded as a special case of mixing adversarial and stochastic 
regimes. Since there is always a jammer randomly attacking 
kj channels constantly over time, we will have the following 
theorem for the AUFH-EXP3 -h-'’^'^‘^ algorithm, which is a 
much more refined regret performance bound than the general 
regret bound in the adversarial regime. 


Theorem 5. Let c > 18 and rjt > jit- Let t* be the minimal 


integer that satisfies t* > 


max < t* 


A/Aiff 


4A ln(t*)^ 
ln(n) 


and Let f*(/) = 


I and t* = maxy^n}t*{f). The regret 


c(ln t) 


of the AUFH-EXP3-H- algorithm with (tjf) = 
termed as AUFH-EXP3 -h-'’'^‘^ under oblivious jamming attack, 
in the mixed stochastic and adversarial regime satisfies: 

n—kr /, , , ^q\ n—kr 


Rjt) < E o 

/=l, A(/)>0 
+Akj\/ tnhvn 


( fcr ln(t)^ \ 

V ^(/) ) 


E A(/)U(/) 

/=LA(/)>0 


= o + nU + o (fc.VfTObT^). 

Note that the results in Theorem 5 has better regret perfor¬ 
mance than the results obtained by adversarial MAB as shown 
in Theorem 1 and the anti-jamming algorithm in ini. 


Theorem 6. Let c > 18 and pt > jit- Let t* be the minimal 
integer that satisfies t* > in(n) ^*(/) = 


max < t* 


u/A(/T 


I and t* = rnax[f^n}t*jf)- The regret 
of the AUFH-EXP3-H- algorithm with ^tjf) = *(j) 2 . 

termed as AUFH-EXP3 -h-^'^‘^ m-memory-bounded adaptive 
jamming attack, in the mixed stochastic and adversarial regime 
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satisfies: 

n — kr /, / x 3 \ n—kj. 

m< E E A(/)r(/) 

/=l,A(/)>0 ^ f=l,AU)>0 

+(m + l)(4kjVnlnn)iti + o{ti ) 

= O + nr + O ((fc,V^rhE^)§ti) . 

The results shown in Theorem 6 provides the first quantitative 
regret performance under adaptive jamming attack, while the 
related work im with the similar adversary model and the 
same communication scenario in this case only provided 
simulation results demonstrations. 

D. Contaminated stochastic regime 

We show that the algorithm AUFH-EXP3++'^^‘^ can still 
retain “polylogarithmic-t” regret in the contaminated stochas¬ 
tic regime with a potentially large leading constant in the 
performance. The following is the result for the moderately 
contaminated stochastic regime. 

Theorem 7. Under the setting of all parameters given in 
Theorem 3, for f*(/) = max|f*, where t* is 

defined as before and = max^f^n}t*{f), and the attacking 
strength parameter G [0,1/2) the regret of the AUFH- 
EXP3-H- algorithm in the contaminated stochastic regime that 
is contaminated after r steps satisfies: 

fcr ^ 


Rit)< E o 

/=l,A(/)>0 


- 2 C)a\/) )+ E A(/)max{f*(/),r}. 
/=l,A(/)>0 


_ f -n-kr In (tf \ , 

-^[,(i-2C)aJ +^h- 
If G (1/4,1/2), we can find that the leading factor 1/(1 — 
2(/) is very large, which is severely contaminated. Now, the 
obtained regret bound is not quite meaningful, which could 
be much worse than the regret performance in the adversarial 
regime for both oblivious and adaptive adversary. 

VI. Proofs of Regrets in Dieferent Regimes 

We prove the theorems of the performance results from the 
previous section in the order they were presented. 

A. The Adversarial Regimes 

The proof of Theorem 1 borrows some of the analysis of 
EXP3 of the loss model in HI. However, the introduction of 
the new mixing exploration parameter and the truth of chan¬ 
nel/frequency dependency as a special type of combinatorial 
MAB problem in the loss model makes the proof a non-trivial 
task, and we prove it for the first time. 

Proof of Theorem 1. 

Proof: Note first that the following equalities can 
be easily verified: E^, 

£((f),E,- 




A{i) = = 

and 


= N. 


Then, we can immediately rewrite R{t) and have 


R{t) = Et 


t 

E 

.S=l 


Eu 


t 




S=1 


The key step here is to consider the expectation of the 
cumulative losses Itii) in the sense of distribution i ^ pt- Let 
^t(f) = J2f£i^t{f)- However, because of the mixing terms 
of pt, we need to introduce a few more notations. Let u = 


(Efei E/e. Eyeld be 


iec 


i^C 


the distribution over all the strategies. Let ujt = et{f) 

be the distribution induced by AUFH-EXP3-H- at the time t 
without mixing. Then we have: 

Ei„.p,4(f) = (1 - f ^sif))^i~u:Jsii) + SsA^ir^uLii) 

= (1 - E/ es(/))(^ exp(-?7s(4(f) 

(5) 

_EEiiEE exp(-?7s4(z))) 

Recall that for all the strategies, we have distribution ujt = 
(a;t(l), ...,u}tiN)) with 

exp(-?7tLt i(z)) 


WtUI = 


E^iexp(-?7tTt-i(j))' 


(6) 


and for all the channels, we have distribution oJtj = 
(wt./(l),-,wt./(n)) 

E*:/'e*exp(-?7tTt_i(i)) 


^t./(/ ) — , j . .XX 

E,=iexp(-77tLt_i(j)) 


(7) 


In the second step, we use the inequalities Inx < x — 1 and 
exp{—x) — 1 + X < x^/2, for all x > 0, to obtain: 

InEi,^,^^ exp{-ris{is{i) - Ejv,.<^^4(j))) 

= lnEi„.<^^ exp{-pJs{i)) + T]sEj.^u;Js{j) 

< - 1 + r]sis{j)) 

< Ifj’ 

/E ^ir^ujs 2 

Moreover, take expectations over all random strategies of 
losses £s{A, we have 


( 8 ) 


E. 


E,x 


MA 


= E* 


E uJs{i)is{i) 

,i=l 


= E* 


= W,tkr 


A ~ 2 


' N ~ 

Ecu«(z)(E4(/)) 

VI 

E Ws(z)fcrE ^s(/) 

_i=l /Gj 


_i=l /Gi 


E ^sUf E ^s{i) 


= kf&t 



_/=l iGSrVGi 

- 

II 


^sjW) 

< fc^Et 


II 


E esif')Asj{f) 
/' = ! 


?<>(/') 


(9) 

< 2krn, 


_ j UJs,f{f*) 

~ (l-E/ et(/))‘^«,/(/')+E/ei et(/)|{ieC:/Gi}| 

where the last inequality follows the fact that 
(1 - E/^t(/)) > ^ by the definition of £((/). 

In the third step, note that Zo(f) = 0. Let $t(p) = 
i In Efci 6xp(—pL((z)) and $o(?7) = 0. The second term 
in @ can be bounded by using the same technique in HI 
(page 26-28). Let us substitute inequality into (|8]l, and 
then substitute (|8]l into equation Q and sum over t and take 
expectation over all random strategies of losses up to time t, 
we obtain 


Et 


E A^pJsit) 

+Et 


< fct-nE Vs + ^ -fEEi-«4(*) 

s—1 s—1 


t-1 

E ^siVs+l) - ^siVs) 

S=1 


+ E E7,..p/«(f). 

S = 1 
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Then, we get 

t t 

R{t) = Et'^Ei^pJs{i) - Et'^Ej^^pJs{i) 


S=1 

t 


S = 1 


< kruS^ rjs + 

^ Vt 


s=l 

t 


(“), ^ InA^ 

< krU^ rjs H- \-kr } } ^es{j) 

s=l f=l 


S =1 

t 


(b) , ^ IniV 

< 2krn > ?7s H- 


Inn 


(^) ^ 

< 2krn} rjs 

Note that, the inequality (a) holds by setting £s(*) = 

kr,yi,s, and the upper bound is krYhi 
kr X]s=i S/=i £«(/)■ The inequality (b) holds is because of, 
for every time slot t, rjt > £t{f)- The inequality (c) is due to 
the fact that N < n^''. Setting rjt = /3t, we prove the theorem. 


Proof of Theorem 2. 

Proof: To defend against the m-memory-bounded adap¬ 
tive adversary, we need to adopt the idea of the mini-batch 
protocol proposed in El. We define a new algorithm by 
wrapping AUFH-EXP3-H- with a mini-batching loop ifTTI . We 
specify a batch size t and name the new algorithm AUFH- 
EXP3-H-7-- The idea is to group the overall time slots 1, ...,f 
into consecutive and disjoint mini-batches of size r. Viewing 
one signal mini-batch as a round (time slot), we can use 
the average loss suffered during that mini-batch to feed the 
original AUFH-EXP3-H-. Note that our new algorithm does not 
need to know m, which only appears as a constant as shown in 
Theorem 2. So our new AUFH-EXP3-n-r algorithm still runs 
in an adaptive way without any prior about the environment. 
If we set the batch r = {‘ikr\fnhin)~^t^ in Theorem 2 of 
El, we can get the regret upper bound in our Theorem 2. ■ 


Lemma 10. Let {£t{f)}'^i be non-increasing deterministic 
sequences, such that £((/) < £*(/) with probability 1 and 
£t(/) < £*(/*) for all t and /. Define vt{f) = ELi 
and define the event £( 

tA{f) - {Uin - Ltif)) 

< \/2(^'t(/) + 

Then for any positive sequence 6i, & 2 , and any t* >2 the 
number of times channel / is played by AUFH-EXP3-H- up 
to round t is bounded as: 


E[iVt(/)] <(t*-l)+Y: e-^‘+kr E £.(/)!{£/} 

s—t* s—t* * 

t 


where _ 

h,{f) = tAif)- 




kr-S 






Proof: Note that the elements of the martingale difference 
sequence {A(/) - (ft(/) - ^ by max{A(/) -f 

W*)} = - 

1 /4, we can simplify the upper bound by using ^ 

&(/*) 

We further note that 

EE, \{A{f)-{Uf)-un)f 

s=l I- 

< EE, \{l{f)-l{f*)f 


S = 1 

t 


E (e. 

S = 1 ^ 


iUfY 


■E, 


{ur 


< 

(a) 

< 


s=l ^ ' 

S i^krCsU) kr-esif*)'} 


t 


< E {k^ijf) + krelir)) ~ 

S—l ^ ® s / 


B. The Stochastic Regime 


Our proofs are based on the following form of Bernstein’s 
inequality with minor improvement as shown in ll24l . 

Lemma 8. (Bernstein’s inequality for martingales). Let 
Xi,...,Xm be martingale difference sequence with respect 
to filtration F = {Fi)i<k<m and let Yk = Ei=i E be 
the associated martingale. Assume that there exist positive 
numbers v and c, such that Xj < c for all j with probability 


landEr=iE {XkY\^k-i 


< V with probability 1. 


F[Ym > \/E6 + y] < e-Y 

We also need to use the following technical lemma, where 
the proof can be found in 12^ . 

Lemma 9. For any c > 0, we have E^o = O (^). 

To obtain the tight regret performance for AUFH-EXP3-H-, 
we need to study and estimate the number of times each of 
channel is selected up to time t, i.e., Nt{f). We summarize it 
in the following lemma. 


with probability 1. The above inequality (a) is due to the 
fact that qt{f) > Y,f(zi£tif) \{i & C : f € i}\. Since each 
/ only belongs to one of the covering strategies i G C, 
\{i G C : f G i}\ equals to 1 at time slot t if channel / is 
selected. Thus, qt{f) > E/Gi^‘(/) = krStif)- 

Let £( denote the complementary of event £(. Then by the 
Bernstein’s inequality F[£(\ < e“^‘. The number of times the 
channel / is selected up to round t is bounded as: 

nNtif)] = E = /] 

S=1 

= E = f\£Li]P[£Li] 

s=l _ _ 

+P[A, = 

< E P[41, =r+PfCT] 

< j:nAs = f\£Li]t^sf 
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We further upper bound P[>ls = y as follows: 

]P[^s = 

< (wt(/) + fc^£s(/))l{£/_j 




S —1 (0 > 


VKs- 

Y2- 


{ff-ii 


-Vt^i 


7Tor)i 


{ff-ii 


= iKssU) + 

<(fcres(/) 


Eili e 


(&) 


{ff-J 


< (fcr£s(/) + 


(c) 


{ff-il 


The above inequality (a) is due to the fact that channel / 
only belongs to one selected strategy i in f — 1, inequality (b) 
is because of the cumulative regret of each strategy is great 
than the cumulative regret of each channel that belongs to the 
strategy, and the last inequality (c) we used the fact that jjf) 
is a non-increasing sequence Vt{f) < Substitution of 

this result back into the computation of IE[iVt(/)] completes 
the proof. ■ 

Proof of Theorem 3. 

Proof: The proof is based on Lemma 10. Let ht = 
ln{tA{fY) and £((/) = St{f). For any c > 18 and any 
t > t*, where t* is the minimal integer for which t* > 


4Ynln {PAjfy 


A(/j^ln(n) 

ht{f) = tA{f) - 


, we have 




> tA{f) - 2 


tbt 


krEt(f) 

(i+s^ 


(3 + ^)^* 

3et(/*) 


kret(f) 3et(f) 
2 (1+^0 ' 


The above inequality (a) is due to the fact that (1 — ^ — 

^ is an increasing function with respect to kr{kr >1). 

Plus, as indicated in work EO), by a bit more sophisticated 
bounding c can be made almost as small as 2 in our case. By 
substitution of the lower bound on ht{f) into Lemma 10, we 
have 


nNtif)]<t* + ^ + k 


< k. 


c In (t) 


- "'’'^(7?" acTF ^''A(7F 


A(/f 

‘"(0 _L o{ 


+ E 

S=1 


^(«-l)in(n) 


)+t* 


where we used lemma 3 to bound the sum of the exponents. 
In addition, please note that t* is of the order 0{ )■ 


Proof of Theorem 4. 

Proof: The proof is based on the similar idea of Theorem 
2 and Lemma 10. Note that by our dehnition At{f) < 1 and 
the sequence £^{f) = £( = min{^,/3t, £Al£L} satishes the 
condition of Lemma 10. Note that when Pt > j e., for 

t large enough such that t > we have 

Let bt = ln{t) and let t* be large enough, so that for all t > t* 
we have t > " ^nd t > . With these parameters 

and conditions on hand, we are going to bound the rest of 
the three terms in the bound on E[Ai(/)] in Lemma 10. The 


upper bound of X]s=f ® obtain. For bounding 

kr E‘=f we 


note that f/ holds and we have 


>7- 


(o) / 

A(/)( 


2t 


Ltif)) 

>jiLtin- 


/ tbt (l + 'E 

f) 2^ 

/ krCt Se, 

t 


ln(i) 

3cln(£) j 


1.25t \ 


^cln(t) 3cln(t) J 


(b) 

> 


_2_1.25 




3c 


) > ^A(/), 


where the inequality (a) is due to the fact that j{tA{f) — 
is an increasing function with respect 




yfckr ln(i) 3cln(t) 

to kr{kr > 1) and the inequality (b) due to the fact that for 
f > f* we have yjln{f) > 1/A(/). Thus, 


£n(/)l 


{fLil - 


c(lnf) 4c^(lnf) 


< 


tAPfY - tAifY 


for the 


and KYYs^f ^ { ^AUf )- Finally’ 

last term in Lemma 10, we have already get ht{f) > ^A{f) 
for t > t* as an intermediate step in the calculation of bound 
on At(/). Therefore, the last term is bounded in a order of 


o( 


A(/)^ 


). Use all these results together we obtain the results 


of the theorem. Note that the results holds for any rjt > Pt- 


C. Mixed Adversarial and Stochastic Regime 
Proof of Theorem 5. 

Proof: The proof of the regret performance in the mixed 
adversarial and stochastic regime is simply a combination 
of the performance of the AUFFI-EXP3-n-'^'^‘^ algorithm in 
adversarial and stochastic regimes. It is very straightforward 
from Theorem 1 and Theorem 3. ■ 

Proof of Theorem 6. 

Proof: Similar as above, the proof is very straightforward 
from Theorem 2 and Theorem 3. ■ 


D. Contaminated Stochastic Regime 
Proof of Theorem 7. 

Proof: The key idea of proving the regret bound under 
moderately contaminated stochastic regime relies on how to 
estimate the performance loss by taking into account the 
contaminated pairs. Let denote the indicator functions 
of the occurrence of contamination at location {t, /), i.e., ^ 

takes value 1 if contamination occurs and 0 otherwise. Let 
mtif) = l*jit{f) + (1 - j)/r(/). If either base arm / was 

contaminated on round t then mt{f) is adversarially assigned 
a value of loss that is arbitrarily affected by some adversary, 
otherwise we use the expected loss. Let Mpf) = X]l=i knpf) 
then {Mt{f) - Mtif*)) “ (-^t(/) “ ^t(/*)) is a martingale. 
After T steps, for t > t, 

(Mtif) - Mtif*)) > tram{ltj,ltj,}m) - itif*)) 

-ff min{l - 1*^, 1 - l*^.}(p(/) - p(/*)) 

> -CtAif) + it- CtAif))Aif) > (1 - 2C)fA(/). 
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Define the event z (: 


Wt-i ■ 


/- ~ \ - (3 + 

(l-2C)fA(/)- (l,(/) - Ltif*)) < 2v^+ ^ 3 /^ , 

where is defined in the proof of Theorem 3 and i/t = 


The 


Ss=i fe^- Then by Bernstein’s inequality P[^/] < e“^‘ 
remanning proof is identical to the proof of Theorem 3. 

For the regret performance in the moderately contaminated 
stochastic regime, according to our definition with the attack¬ 
ing strength ( € [0,1/4], we only need to replace A(/) by 
A(/)/2 in Theorem 5. ■ 


VII. The Computational Efficient Implementation 
OF THE AUFH-EXP3-H- Algorithm 

The implementation of algorithm 1 requires the computation 
of probability distributions and storage of N strategies, which 
is obvious to have a time and space complexity 0{n^^). As 
the number of channels increases, the strategy will become 
exponentially large, which is very hard to be scalable and 
results in low efficiency. To address this important problem, 
we propose a computational efficient enhanced algorithm by 
utilizing the dynamic programming techniques, as shown in 
Algorithm 2. The key idea of the enhanced algorithm is to 
select the receiving channels one by one until channels are 
chosen, instead of choosing a strategy from the large strategy 
space in each time slot. 

We use S (/, k) to denote the strategy set of which 
each strategy selects k channels from f,f + We 

also use S (/, to denote the strategy set of which each 

strategy selects k channels from channel 1,2,...,/. We de¬ 
fine Wt{f,k) = E*GS(/./c)n/G* “'*(/) Wtif,k) = 
J2iesu,k)Y[fei'^t{f), Note that they have the following 
properties; 

Wtif, k) = Wtif + 1, k) + + k - 1), (10) 

Wt{f, k) = Wtif- 1, ~k) + wtif)Wtif- l,k- 1), (11) 

which implies both Wtif, Wtif, ^e calculated in 

Oikru) (Letting Wtif, 0) = 1 and Win + l,k) = 1V(0, fc) = 
0) by using dynamic programming for all 1 < / < n and 

l<k<kr. 

In step 1, a strategy should be drawn from strategies. 
Instead of drawing a strategy, we select channel for the strategy 
one by one until a strategy is found. Here, we select channels 
one by one in the increasing order of channel indices, i.e., we 
determine whether the channel 1 should be selected, and the 
channel 2, and so on. For any channel /, if fc < kr channels 
have been chosen in channel 1,— 1, we select channel / 
with probability 

Wt-iif)Wtif + l,kr-k-l) 

Wt-lif,kr-k) ^ ^ 

and not select / with probability ■ Let tu(/) = 

wt-iif) if channel / is selected in the strategy i; w(/) = 0 
otherwise. Obviously, wif) is actually the weight of / in the 
strategy weight. In our algorithm, Wt-iif) = U%iwif). 
Let c(/) = 1 if / is selected in i\ c(/) = 0 otherwise. 
The term X) f=i '^(/) denotes the number of channels cho¬ 
sen among channel 1,2,...,/ in strategy i. In this imple¬ 
mentation, the probability that a strategy i is selected is 


^ ^(/)iUt-i(/-n,fc.-E/=i c{f)) 

/=! 

This probability is equivalent to that in Algorithm 1, which 
implies the implementation is correct. Because we do not 
maintain wtii), it is impossible to compute qtif) as we have 
described in Algorithm 1. Then qtif) can be computed within 
Oinkr) as in Eq.(4) for each round. 

Moreover, for the exploration parameters et(/), since there 
are kr parameters of et(/) in the last term of Eqs. (|6]l and there 
are n channels, the storage complexity is Oikru). Similarly, 
we have the time complexity OikrUt) for the maintenance of 
exploration parameters £((/). Based on the above analysis, we 
can summarize the conclusions into the following theorem. 

Theorem 11. The Algorithm 2 has time complexity 
OikrTit) and space complexity Oikru), which has the linear 
scalability along with rounds t, and parameters kr and n. 


Algorithm 2 An Computational Efficient Implementation of 
AUFH-EXP3-H- 


Input; n, kr,t, and See text for definition of rjt and /*(/). 
Initiahzation: Set initial channel weight wo(/) = IjV/ £ 
[l,n]. Let Wt(/,0) = 1 and_lV(n-h 1, fc) = 1V(0, fc) = 0 
and compute Woif, k) and Woif, k) follows Eqs. (ITOl i and 
(HB, respectively, 
for time slot f = 1, 2 ,... do 

1; The receiver selects a channel /, V/ £ [Ijii] one by 
one according to the channel’s probability distribution 
computed following Eq. (ITB until a strategy with kr 
chosen channels are selected. 

2; The receiver computes the probability (?*(/),V/ £ 
[l,n] according to Eq. (|6]l. 

3; The receiver calculates the loss for channel /, 
it-iif),yf £ It based on the received channel gain 
9t-iif) by using It-iif) = 1 - 9t-iif)- Compute the 
estimated loss ^i(/),V/ £ [l,n] as follows: 


m) 


if channelf £ It 
0 otherwise. 


4; The receiver updates all channel weights as Wt if) = 
tt't-i (/) V/ £ [l,n], and com¬ 

putes Wtif,k) and Wtif,k) follows Eqs. (ITOl) and (ITTI) . 
respectively. 

end for 


Besides, because of the channel selection probability for 
qtif) and the updated weights of Algorithm 2 equals to 
Algorithm 1, all the performance results in Section IV still 
hold for Algorithm 2. 

VIII. Implementation Issues and Simulation 
Results 

In this section, we consider the wireless communications 
from a transmitter to a receiver that is by default in the 
stochastic regime with Bernoulli distributions for rewards. 
W.l.o.g., we assume a constant unitary data packet rate from 
the transmitter for each channel kt C St over every time slot 
t, i.e. M = 1 packet, where kt = 4. All experiments were 
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Fig. 2; Performance Comparison in the Stochastic Regime. 
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Fig. 3; Performance Comparison in the Contaminated Stochastic Regime. 
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Fig. 4; Performance Comparison in the Oblivious Adversarial Regime. 


conducted on an off-the-shelf desktop with dual 6-core Intel 
i7 CPUs clocked at 2.66Ghz. For all the suboptimal channels 
the rewards are Bernoulli with bias 0.5, and we set a single 
best channel whose reward is Bernoulli with bias 0.5 -f A. 

To show the advantages of our AUFH-EXP3-H- algorithms, 
we compare their performance to other existing MAB based 
algorithms, which includes: the EXP3 based anti-jamming 
algorithm in im, and we named it as “Anti-Jam-EXP3”; 
The combinatorial UCB-based algorithm “CombUCBl” with 
almost tight regret bound as proved in ll29l : the combinatorial 
version of the Thompson’s sampling algorithm lf35l . Here we 
consider the use of the Thompson’s sampling algorithm for 
comparison due to its empirically good performance indicated 


in IMl . We make ten repetitions of each experiment to reduce 
the performance bias. In Eig. 2-5, the solid lines in the 
graphs represent the mean performance over the experiments 
and the dashed lines represent the mean plus on standard 
deviation (std) over the ten repetitions of the corresponding 
experiments. Note that, for a given optimal channel access 
strategy, small regret values indicate the large number of data 
packets reception. 

At hrst, we run our experiments by choosing different size 
of available channels n = 8,16,60. The size of receiving 
channels and gap is always = 4 and A = 0.2, respectively. 
Our hrst set of experiments shown in Eig. 2, we run each of 
the algorithm for 10^ rounds. We choose (n, kr) pairs equals 
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Fig. 5; Performance Comparison in the Adaptive Adversarial Regime. 


to ( 8 ,4), (16,4), (60,4) to see how our algorithms perform 
from a small size of channel access strategy set ((®) = 70) to 
a large size of channel access strategy set ((^ 4 ) = 487635). 
For different versions of our AUFH-EXP3++ algorithms, they 


are parameterized by ^t(/) = 




where At(/) is the 


32tAt(/)= _ 

empirical estimate of At(/) dehned in (IV-Blb . The target 
of our experiment is to demonstrate that in the stochastic 
regime the exploration parameters are in full control of the 
performance we run the AUFH-EXP3++ algorithm with two 
different learning rates. AUEH-EXP3++^^^ corresponds to 
Tjt = Pt and AUEH-EXP3++"^‘^‘^ corresponds to rjt = 1. Note 
that only AUEFI-EXP3++^^^ has a performance guarantee 
in the adversarial regime. Eor our AUEH-EXP3++ algorithms, 
we transform the rewards into losses via £t{f) = 1 — 9t{f), 
other algorithms operate directly on the rewards. 


Erom the results presented in Eig. 2, we see that in all 
the experiments, the performance of AUEH-EXP3++^^^ is 
almost identical to the performance of CombUCBl. That 
means our algorithm can attain almost optimal transmission 
efficiency in stochastic environments, and our algorithm scales 
well in the large channel access strategy setting. Thus, AUEFI- 
EXP3++^^^ has all advantages of the stochastic MAB al¬ 
gorithms, and has much better performance gain than Anti- 
Jam-EXP3 ifTTl . Moreover, unlike CombUCBl and Thomp¬ 
son’s sampling, AUEH-EXP3 -h-^^^ is secured against a 
potential adversary during the wireless communications game. 
In addition, the AUEH-EXP3 -h-"’^‘^‘^ algorithm can be seen 
as a special teaser to show the algorithm performance in 
the condition of r]t > fit- B performs better than AUEH- 
EXP3 -h-^^^, but it does not have the adversarial regime 
performance guarantee. 


In our second set of experiments, we simulate moderately 
contaminated stochastic environment by drawing the hrst 
2,500 rounds of the game according to one stochastic model 
and then switching the best channel and continuing the game 
until 8 * 10® rounds. This action can be regarded as an 
occasional jamming behavior. In this case, the contamination 
is not fully adversarial, but drawn from a different stochastic 
model. We run this experiment with A = 0.2, = 2 and 

n = 4,8,16 to see the noticed leaning performance. The 
results are presented in Eig. 3. Although it is hard to see the 
hrst 2,500 rounds on the plot, their effects on all the algorithms 
is clearly visible. Despite the initial corrupted rounds the 
AUEH-EXP3 -h-^^^ algorithm successfully returns to the 


stochastic operation mode and achieves better results than 
Anti-Jam-EXP3 ifTTI . 

To the best of our knowledge, it is very hard to simulate 
the fully adversarial regime with arbitrarily changing oblivious 
jammer. In our third set of experiments shown in Pig. 4, 
we emulate the adversary regime under oblivious jamming 
attack by setting the A value of the best channel randomly 
from [0.1, 0.3] and switch the best channel to different indices 
of channels in the channel set at every other time slot by 
a pseudorandom sequence generator function. The channel 
rewards are determined before running the algorithm. It is not 
difficult to feel that the reward sequences still follow certain 
stochastic pattern, but not that obvious. We set the typical 
parameter kr = 2, A — 0.2 and run all the algorithms up to 
8 * 10® rounds. It can be found that our AUPH-EXP3 -h-^^^ 
algorithm will be close to and have slightly better performance 
when compared to Anti-Jam-EXP3 EH, which conhrms with 
our theoretical analysis. 

In our fourth set of experiments shown in Pig. 5, we 
simulate the adaptive jamming attack case in the adversarial 
regime with a typical memory m = 80. We can see large 
performance degradations for all algorithms when compared 
to the oblivious jammer case. We can hnd that the performance 
of AUPH-EXP3 -h-^^^ and Anti-Jam-EXP3 Ell still enjoys 
the almost the same regret performance, and their large regrets 
indicate their sensitiveness to the adaptive jammer. 

We also compared the computing time of the two versions 
of AUPH-EXP3 -h-^^^, Algorithm 1 and Algorithm 2, with 
different set of (n, kr) pairs for each round. The results are 
listed in table I. Prom the results, we can see that Algorithm 
2 scales linearly with the increase of the size of n and kr, 
and have very low computational cost than the Algorithm 
1. Imagine in a practical typical multi-channel wireless com¬ 
munication system with (n, kr) = (64,12), the Algorithm 1 
takes about 162 seconds to hnish one round calculation that is 
infeasible, while the Algorithm 2 takes about .134 seconds to 
hnish one round calculation that is very reasonable in practical 
implementation. 

Eor brevity, we do not plot the regret performance hgures 
for the mixed adversarial and stochastic regime. However, in 
our last experiments, we compare the received data packets 
rate (Mbps) for all the four different regimes after a relative 
long period of learning rounds t = 2 * 10^. Here we assume 
M — 1 packet contains 1000 bits and each time slot is 
just one second. We set kr = 2 and A = 0.2 as hxed 
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Fig. 6: Received Data Packets Rate in Different Regimes. The legend below corresponds to all figures. 


TABLE I: Computation Time Comparisons of Algorithm 1 and Algorithm 2 


(n, kr) 

Alg. Ver. vs Comp. Time (micro seconds) 

(12,4) 

(24,4) 

(48, 6) 

(48,12) 

(64, 6) 

(64,12) 

(64,24) 

AUPH-EXPS-n--^ " ^: Algorithm 1 

23 

167 

699 

2247 

8375 

162372 

862961 

AUFH-EXP3++'®*f ^ :Algorithm2 

4 

9 

31 

57 

74 

134 

280 


values for all different size of channel set n. We plot our 
results in Fig. 6. It is easy to find that our algorithm AUFH- 
EXP3++®^^ attains almost all the advantages of the stochas¬ 
tic MAB algorithms CombUCBl, and has better throughput 
performance than Anti-Jam-EXP3. As we have noticed, we 
also put the results of CombUCBl 1291 in the oblivious 
adversarial, adaptive adversarial and contaminated regimes, 
etc., although the algorithm is not applicable in theory. This 
proves that our proposed algorithm AUFH-EXP3-H- can be 
applied for general unknown communication environments in 
different regimes with flexibility. Interestingly, we find that the 
Thompson’s sampling algorithm iTSll performs superiorly in 
all regimes, and this empirical fact is observed in the machine 
learning society. We believe it is a promising direction to study 
its theoretical ground from the beginning for the collected 
(security) non-i.i.d. data inputs. 

IX. Conclusion and Future Works 

In this paper, we have proposed the first adaptive 
multichannel-access algorithm for wireless communications 
without the knowledge about the nature of environments. At 
first, we captured the feature of the general wireless environ¬ 
ments and divided them into four regimes, and then provided 
solid theoretical analysis for each of them. Through theoretical 
analysis, we found that the almost optimal performance is 
achievable for all regimes. Extensive simulations were con¬ 
ducted to verify the learning performance of our algorithm in 
different regimes and much better performance improvements 
over classic approaches. The proposed algorithm could be 


implemented efficiently in practical wireless communication 
systems with different sizes. Our framework is of general 
value, which can be extended by incorporating power control 
module based on estimated gradient algorithms (under bandit 
feedback), taking power budgets into account and accessing 
problems based on observed side information (as “contextual 
bandit” m) for wireless communication scenarios under un¬ 
expected security attacks. The idea of this work could also be 
combined with other online learning-based channel prediction 
algorithms to perform the joint optimal resource allocation 
with the configuration of physical layer techniques, such as the 
MIMO channel and its power allocations. We plan to extend 
our proposed algorithms to general combinatorial settings and 
forecast that their variants can be applied in many practical 
tough environments for wireless networks monitoring, secure 
routing problems, rumors propagation in social networks (with 
contextual bandit setting), etc. 
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